1 Introduction

Since 2007, Airbnb has been playing a major role in the travel calendar. Furthermore, it has significantly changed the way people plan and arrange travel. As a unicorn that lets people make better use of their spare homes. Airbnb accelerates and benefits property usage while bringing convenience to travelers. However, legends can also have some weaknesses, especially today - many competitors could easily steal the market from the former giant.

Airbnb’s weaknesses mainly lie in user experience and service delivery.

From our personal experience, Airbnb is weak, especially with price comparisons between its own listings, a recommendation algorithm, and an annoying quick selector, not to mention no features to help you tailor your travel plans to your situation.

Compared with the traditional hotels, Airbnb’s unbranded listings make users diffcult to compare prices effectively, especially since much of the data that can be obtained from the API is not clearly reflected in the quick search interface, resulting in an unsimplified decision-making process.

In addition, UI problems are also obvious. The current quick selector is often difficult to use because of its long list and unclear icons. Additionally, Airbnb lacks comprehensive features to help travelers plan their trip based on their mood, budget, and preferred types of landscapes. Effective utilization of this market blue ocean will significantly improve user loyalty.

As the purpose of our project, we develop an Airbnb auxiliary plug-in to solve related problems, which mainly includes recommending travel destinations and accommodation through user data revenue combined with price predictions. When users search, the system generates recommended house prices for each listing result and gives suggestions.In order to handle the problems mentioned above, a predictive model is the first step.

2 Data Manipulation and Visualization

At this stage, we focus on data collection and cleaning. This includes raw data obtained from the Airbnb API (actually Kaggle for this assignment), including detailed information, reviews and brief information, etc.; population data obtained from CBS, city data obtained from the Amsterdam Municipality Geographic Information website and other data sets converted to Structured format suitable for analysis. This involves cleaning the data to eliminate inconsistencies and irrelevant information. We then conducted exploratory data analysis (EDA) to reveal initial insights and patterns.

2.1 House Price and General Characteristics for listings

In this section, we’ll dive into the specifics of pricing and the general characteristics of Airbnb listings.

(1) Loading and Filtering Data

We began by importing the Airbnb dataset, which contains detailed information about listings in Amsterdam. This dataset includes a wide array of variables such as price, location, etc.Geo-spatialized data was create for the mapping and data analysis at the next step.

# import planning district and housing data
district <- 
  st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/neighbourhoods.geojson") %>%
 #dplyr::select(DIST_NAME,ABBREV) %>% #Select data for later prediction
  st_transform('EPSG:7415')



nhoods_0 <- 
  read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/listings.csv") %>%
  mutate( lat = latitude, lon = longitude)%>%
  st_as_sf(coords=c("longitude","latitude"), crs=4326) %>%
   st_transform('EPSG:7415')


nhoods <- st_join(nhoods_0, district)
nhoods <-  transform(nhoods, price = as.numeric(price), id = as.numeric(id)) %>%
  filter(price!= 0 & price > 40 & price < 300) %>%
  st_intersection(district)

 
## do not run
 # listing_details <- lapply(listing_details, function(x) {x[x == ""] <- NA})%>%
#    as_data_frame()%>%
 #   na.omit()
    
sentiment_list<- 
  read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/sentiment_list.csv")
  
nhoods <- inner_join(nhoods,sentiment_list, by = 'id')

(2) Listings and Price distribution in Amsterdam

By plotting listing density, prices and more on a map of Amsterdam, we were able to observe geographical patterns in pricing and property types - with city centers having significantly more densely populated Airbnbs. And their average prices are also relatively higher - which is consistent with the typical distribution of travel cities. Spatial analysis also reveals potential correlations between location and price, which indicates that we should subsequently conduct Spatial Lag analysis to enhance the predictive power of the model.

#### (1) Loading and Filtering Data

ggplot() +
  geom_sf(data = district, fill = NA)+
  geom_hex(data = nhoods, aes(x = st_coordinates(nhoods)[,1], y = st_coordinates(nhoods)[,2]),binwidth = c(300,300)) +
  #coord_fixed() +  
  labs(title = "Listing density Hexbin map in Amsterdam",
       subtitle = "Based on 2019 data",
       x = "longitude",
       y = "latitude") +
  scale_fill_viridis()+
  theme_light()

ggplot() +
  geom_sf(data = district, fill = "grey90", alpha = 0.5) +
  geom_sf(data = nhoods, aes(colour = q5(price)), 
          show.legend = "point", size = .55) +
  scale_colour_manual(values = palette5,
                   labels=qBr(nhoods,"price"),
                   name="Price") +
  labs(title="Airbnb Listed Price",
       subtitle = "Amsterdam, 2019") +
  mapTheme()

district_data <- nhoods%>%
  group_by(neighbourhood)%>%
  summarise(average_price = mean(price))%>%
  st_drop_geometry()%>%
  left_join(.,district)%>%
  st_as_sf()




ggplot() +
  geom_sf(data = district_data, aes(fill = average_price), alpha = 0.8) +
  scale_fill_viridis_c() +
  geom_sf(data = nhoods, size = 0.1, color = "#FFD700", alpha = 0.1, show.legend = FALSE) +
  labs(title = "Airbnb Average Listed Price by Neighbourhood"       ,
       subtitle = "Amsterdam, 2019") +
  mapTheme()

2.2 Listing Details and House-Specific Data

This section turns to examining the specifics of individual listings and the analysis of specific housing data. Doing so reveals the unique features of each property, thereby increasing the amount of data and improving the accuracy of the model.

(1) Loading and Filtering Data

We mainly get the data through the listing_details files - they are all accessible through the Airbnb API

# import listing details data
listing_details<- read.csv("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/listings_details.csv",na.strings = c("", "NA")) %>%   
  st_as_sf(coords=c("longitude","latitude"), crs=4326) %>%
   st_transform('EPSG:7415') %>%
  select(id,host_id,host_since,host_response_time,host_response_rate,host_listings_count,host_verifications,host_has_profile_pic,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,cleaning_fee,extra_people,minimum_nights,maximum_nights,cancellation_policy,review_scores_rating,reviews_per_month,price ) %>%
  filter( beds< 10)

# 
# ==> 1. host_id:continuous numeric value
# ==> 3. minimum_nights:continuous numeric value
# ==> 4. maximum_nights:continuous numeric value
# ==> 5. accommodates:continuous numeric value
# ==> 7. host_has_profile_pic:2 categories
# ==> 8. is_location_exact:2 categories
# ==> 9. room_type:3 categories
# ==> 10. cancellation_policy:4 categories
# ==> 11. bed_type:5 categories
# ==> 12. property_type:31 categories

# date time calculation
listing_details$host_since <- as.Date(listing_details$host_since, format = "%Y-%m-%d")
target_date <- as.Date("2019-12-31")
listing_details$host_since_days <-  as.numeric(target_date - listing_details$host_since)

#===> Part 2 | 2 columns need to change specific "N/A" value
listing_details<- listing_details%>% mutate(host_response_time = ifelse(host_response_time == 'N/A', 'unknown', host_response_time))
 
listing_details$price <- as.numeric(gsub("\\$", "", listing_details$price))
listing_details$host_response_rate <- as.numeric(gsub("%", "", listing_details$host_response_rate))

listing_details <- listing_details %>% 
    mutate(
        host_response_rate = case_when(
            host_response_rate %in% 0:1 ~ '~0%',
            host_response_rate %in% 2:25 ~ '1-25%',
            host_response_rate %in% 26:35 ~ '26-35%',
            host_response_rate %in% 36:45 ~ '36-45%',
            host_response_rate %in% 46:55 ~ '46-55%',
            host_response_rate %in% 56:70 ~ '56-70%',
            host_response_rate %in% 71:79 ~ '70-79%',
            host_response_rate %in% 80:85 ~ '80-85%',
           
             host_response_rate %in% 86:90 ~ '86-90%',
            host_response_rate %in% 91:95 ~ '91-95%',
            host_response_rate %in% 96:98 ~ '96-98%',
            host_response_rate %in% 99:100 ~ '99-100%',
            is.na(host_response_rate) ~ 'no data')) 
  
  
# ==> Part 3 | 4 columns need to delete a few NA values  
listing_details <-  listing_details[!is.na(listing_details$host_since) &
                !is.na(listing_details$bathrooms) & 
                !is.na(listing_details$bedrooms) & 
                !is.na(listing_details$beds), ]

#======> Part 4 | 3 columns need to delete nearly 2400 in total NA values
listing_details$cleaning_fee <- 
  as.numeric(gsub("\\$", "", listing_details$cleaning_fee))

listing_details$cleaning_fee[is.na(listing_details$cleaning_fee)] <- 0

listing_details$extra_people <- 
  as.numeric(gsub("\\$", "", listing_details$extra_people))

listing_details$extra_people[is.na(listing_details$extra_people)] <- 0
listing_details$reviews_per_month[is.na(listing_details$reviews_per_month)] <- 0


# ==> Part 5: array columns

listing_details$host_veri_length <-
    sapply(listing_details$host_verification, 
           function(x) length(strsplit(gsub("\\[|\\]|'", "", x),
                                       ",\\s*")[[1]]))

listing_details$amenities_lengths <- sapply(listing_details$amenities, function(x) {
    cleaned_content <- gsub("^\\{|\\}$", "", x)
    elements <- strsplit(cleaned_content, ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", perl = TRUE)[[1]]
    elements <- trimws(gsub("^\"|\"$", "", elements))
    length(elements)
})

listing_details <- listing_details %>%
  select(-amenities,-host_verifications,-host_since) %>%
  filter(minimum_nights<32 & bathrooms < 11) %>%
  na.omit()
column_names_list <- names(listing_details)
column_names_list <- paste(column_names_list, collapse = ",")
listing_details_numeric <- sapply(listing_details, is.numeric)
listing_details_numeric <- listing_details[,listing_details_numeric]%>% st_drop_geometry()
listing_details_categorical <- sapply(listing_details, function(x) is.factor(x) || is.character(x))
listing_details_categorical["price"] <- TRUE
listing_details_categorical<- listing_details[,listing_details_categorical]%>% st_drop_geometry()


#column_names_list
#id,host_id,host_response_time,host_response_rate,host_listings_count,host_has_profile_pic,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,cleaning_fee,extra_people,minimum_nights,maximum_nights,cancellation_policy,review_scores_rating,reviews_per_month,host_since_days,host_veri_length,amenities_lengths

(2) Analyse Data Correlation

We focus on understanding the correlation between various variables from the Airbnb listing details and the prices, as well as in between them.

To make the analysis easier, all the data were divided into categorical data and numeric data.

For the categorical data, We made plots to show either the “Price as a function of categorical variables from Airbnb details” and “Categorical variables amounts distribution from Airbnb details”. By doing this, we can find out the relationship behand categories, and we select cancellation_policy,bed_type,property_type,room_type as varibles for modeling.

st_drop_geometry(listing_details_categorical) %>% 
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
     geom_histogram(data = . %>% filter(price >0),stat = 'identity', fill = "#2A9D8F") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Categorical variables amounts distribution from Airbnb details") +
     plotTheme()

st_drop_geometry(listing_details_categorical) %>% 
  gather(Variable, Value, -price) %>% 
  ggplot(aes(x = Value, y = price)) +
   stat_summary(fun = mean, geom = "bar", fill = "#2A9D8F") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of categorical variables from Airbnb details") +
     plotTheme()

# select cancellation_policy,bed_type,property_type,room_type

For the numeric data, we furthermore used ggcorrplot to analyse the relationship among them, and find out that accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month have significant impact on the price.

st_drop_geometry(listing_details_numeric) %>% 
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) + 
    geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables from Airbnb details") +
     plotTheme()

ggcorrplot(
  round(cor(listing_details_numeric%>%st_drop_geometry()%>%na.omit()), 1), 
  p.mat = cor_pmat(listing_details_numeric%>%st_drop_geometry()%>%na.omit()),
  colors = c("#E63946", "white", "#2A9D8F"),
  type="lower",
  insig = "blank",
  lab = TRUE) +  
    labs(title = "Correlation across listing details") 

  #select accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month

listing_details <- listing_details %>%
  select(-price) 

2.3 Sentiment Analysis Result from Previous Review Data

In this section, we delve into the sentiment analysis of previous review data for Airbnb listings. Sentiment analysis is a powerful tool to that shows more detail rather than the review star. By analyzing this, we can fit the moods pattern to future functions of the recommendation algorithm and it can offer valuable insights into how guests perceive different aspects of their stay.

We conduct the analysis in Python using the NLTK and BLOB package, and get all the data back to R by exporting a CSV file. The Polarity is the major factor that is brought into the model building.

sentiment_res<-nhoods%>%st_drop_geometry()%>%select(polarity_name,polarity_overview,polarity_rules,review_polarity,price)%>%na.omit()

ggcorrplot(
  round(cor(sentiment_res), 1), 
  p.mat = cor_pmat(sentiment_res),
  colors = c("#E63946", "white", "#2A9D8F"),
  type="lower",
  insig = "blank",
  lab = TRUE) +  
    labs(title = "Correlation across sentiment results") 

sentiment_res %>% 
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) + 
    geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of Sentiment Result") +
     plotTheme()

#SELECT Review_Polarity,id

2.3 Neighbourhood & Property Characteristics

Thanks to the GIS Portal from the city of Amsterdam, we could get access to a bunch of useful data

-   `high-rise`: high-rise buildings built in each tract
-   `green-roof`: green roofs with its surface in each tract
-   `wall art`: number of wall art in each tract
-   `market`: Number of market in each tract
-   `swimming water`: Number of swimming pool in each tract
-   `tram_metro`: trams and metros in the whole city
-   `flood`:flood area in each tract
-   `parking`: parking lot in each tract
# 1.Load high-rise data with 3 attributes: height, year, and geometry
highrise.sf <- 
  st_read("https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=HOOGBOUW_PUNT&THEMA=hoogbouw") %>%
  dplyr::select(Hoogte, Jaar, geometry) %>% 
  rename(
    height = Hoogte,
    year = Jaar
  )%>%
  st_transform('EPSG:7415')

# 2.Load green roof data with 2 attributes
greenroof.sf <- 
  st_read("https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=GROENE_DAKEN&THEMA=groene_daken") %>%
  dplyr::select(Oppervlakte_m2, geometry) %>% 
  rename(surface = Oppervlakte_m2)%>%
  st_transform('EPSG:7415')

# 3.Load wall art data
wallart.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=WANDKUNST&THEMA=wandkunst')%>%
  dplyr::select(geometry) %>%
  st_transform('EPSG:7415')

# 4.Load market data
market.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=MARKTEN&THEMA=markten')%>%
  dplyr::select(geometry) %>% 
  st_transform('EPSG:7415')

# 5.Load swimming water data
swimming.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=ZWEMWATER&THEMA=zwemwater')%>%
  dplyr::select(geometry) %>% 
  st_transform('EPSG:7415')

# 6.Load tram and metro data
tram_metro.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=TRAMMETRO_PUNTEN_2019&THEMA=trammetro')%>%
  dplyr::select( geometry) %>% 
  st_transform('EPSG:7415')

# 7.Load flood data
flood.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=KLIMAAT_OVERSTROMING&THEMA=klimaatadaptatie')%>%
  dplyr::select(geometry) %>%
  st_transform('EPSG:7415')

# 8.Load parking pressure data
parking.sf <- st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=PARKEERDRUK_BUURTEN&THEMA=parkeerdruk')%>%
  dplyr::select(geometry) %>% 
  st_transform('EPSG:7415')

# 9.Heating supply
heating.sf <-  st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=STADSWARMTEKOUDE_WIJK&THEMA=stadswarmtekoude')%>%
  dplyr::select(geometry) %>% 
  st_transform('EPSG:7415')

# 10 parks
 # https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=PARKPLANTSOENGROEN&THEMA=stadsparken

# 11 green project
 # https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=GROENPROJECTEN&THEMA=groenprojecten
  
  #12. compost
 # https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=BUURTCOMPOST&THEMA=buurtcompost

(1) Adding Housing Value and Age Data

The regional housing value and house age data could also be good indicator of the Airbnb listing prices.

housing_value <- 
  st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/housing_value_ams.geojson") %>%
   st_transform('EPSG:7415') %>%
  select(AvrVal)

housing_value$ID0 <- seq.int(nrow(housing_value))
housing_value_df <- housing_value %>% st_drop_geometry()

temp_nearest <- st_nearest_feature(nhoods,housing_value) %>% as_data_frame()%>%rename(ID0 = value)%>%left_join(housing_value_df, by = 'ID0')

nhoods <- cbind(nhoods, temp_nearest)
  #13. age
age.sf  <-  st_read('https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=BOUWJAAR&THEMA=bouwjaar')%>%
  st_transform('EPSG:7415')%>%
  mutate(age = 2020-Bouwjaar)%>%
  select(age)
age.sf$ID1 <- seq.int(nrow(age.sf))
age_df <- age.sf %>% st_drop_geometry()

temp_near_age <- st_nearest_feature(nhoods,age.sf) %>% as_data_frame()%>%rename(ID1 = value)%>%left_join(age_df, by = 'ID1')


nhoods <- cbind(nhoods, temp_near_age)
nhoods%>%select(price,age,AvrVal)%>%st_drop_geometry()%>% na.omit()%>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) + 
    geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of Housing Property") +
    plotTheme()

2.4 Distance to Neighbourhood Properties Analysis

(1) Key Infrastractures

In this section, we examine the impact of proximity to critical infrastructure on the desirability and pricing of Airbnb properties. The infrastructure we selected is based on the variables imported above. We calculated the distance of each property from the nearest 1 to 5 key infrastructure through nn_function, and performed correlation analysis between the results and Price.

# 1.high-rise
nhoods$highrise.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(highrise.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    highrise_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 1),
    highrise_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 2), 
    highrise_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 3), 
    highrise_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 4), 
    highrise_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(highrise.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("highrise_")) %>%
  filter(price <= 600) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of distance to high-rise") +
     theme_light()

# 2.green roofs
nhoods$greenroof.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(greenroof.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    greenroof_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 1),
    greenroof_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 2), 
    greenroof_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 3), 
    greenroof_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 4), 
    greenroof_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(greenroof.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("greenroof_")) %>%
  filter(price <= 600) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of distance to green roofs") +
     theme_light()

# 3.wall art
nhoods$wallart.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(wallart.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    wallart_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 1),
    wallart_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 2), 
    wallart_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 3), 
    wallart_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 4), 
    wallart_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(wallart.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("wallart_")) %>%
  filter(price <= 600) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of distance to wall art") +
     theme_light()

# 4.market data
nhoods$market.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(market.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    market_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 1),
    market_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 2), 
    market_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 3), 
    market_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 4), 
    market_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(market.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("market_")) %>%
  filter(price <= 600) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of distance to markets") +
     theme_light()

# 5.swimming pool data
nhoods$swimming.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(swimming.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    swimming_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 1),
    swimming_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 2), 
    swimming_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 3), 
    swimming_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 4), 
    swimming_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(swimming.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("swimming_")) %>%
  filter(price <= 600) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of distance to swimming pools") +
     theme_light()

#6 Tram & Metro


nhoods$tram.Buffer <- nhoods %>% 
    st_buffer(660) %>% 
    aggregate(mutate(tram_metro.sf, counter = 1),., sum) %>%
    pull(counter)
nhoods <-
  nhoods %>% 
  mutate(
    tram_nn1 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 1),
    tram_nn2 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 2), 
    tram_nn3 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 3), 
    tram_nn4 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 4), 
    tram_nn5 = nn_function(st_coordinates(st_centroid(nhoods)), st_coordinates(st_centroid(tram_metro.sf)), 5))
nhoods %>%
  st_drop_geometry() %>%
  dplyr::select(price, starts_with("tram_")) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.35) + 
     geom_smooth(data = . %>% filter(price > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of Distance to Tram & Metro Stations") +
     theme_light()

(2) Analysing Correlations

Based on the above analysis, we continued to conduct correlation analysis to determine which variables are important factors affecting Airbnb house prices. Among them, greenroof_nn3, wallart_nn3, market_nn2, tram_nn3, age and AvrVal are more significant.

ggcorrplot(
  round(cor(nhoods%>%select(highrise_nn1,greenroof_nn3,wallart_nn3,market_nn2,swimming_nn3,tram_nn3,price,age,AvrVal)%>%st_drop_geometry()%>%na.omit()), 1), 
  p.mat = cor_pmat(nhoods%>%select(highrise_nn1,greenroof_nn3,wallart_nn3,market_nn2,swimming_nn3,tram_nn3,price,age,AvrVal)%>%st_drop_geometry()%>%na.omit()),
  colors = c("#E63946", "white", "#2A9D8F"),
  type="lower",
  insig = "blank",
   lab = TRUE) +  
    labs(title = "Correlation across distance factors") 

#select greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal

2.5 Load Census Data

In this section, we integrate census data to enrich our analysis of Airbnb listings. Census data can provide valuable background information about the communities where these properties are located, including demographic, economic and social characteristics. This additional layer of data allows for a more complete understanding of the factors that influence the desirability and pricing of Airbnb properties. Our data comes from the 2019 CBS census. This document is similar in structure to the American Community Survey and can provide critical support.

# Load acs data, need data selection here

ams_census <- 
  st_read("https://raw.githubusercontent.com/ObjQIAN/Airbnb_AMS/main/data/ams_census.geojson") %>%
   st_transform('EPSG:7415') %>%
  rename(residents = aantal_inwoners,
         male_residents = aantal_mannen,
         famale_residents = aantal_vrouwen,
         pct_dutch_bkg = percentage_nederlandse_achtergrond,
         pct_wes_bkg = percentage_westerse_migr_achtergr,
         pct_nonwes_bkg = percentage_niet_westerse_migr_achtergr,
         avr_fam_size = gemiddelde_huishoudensgrootte,
         low_income_pct = percentage_laag_inkomen_huishouden,
         high_income_pct = percentage_hoog_inkomen_huishouden,
         median_HH = mediaan_inkomen_huishouden,
         rental_home =percentage_huurwoningen,
        # rental_corp = aantal_huurwoningen_in_bezit_woningcorporaties,
         owned_pct = percentage_koopwoningen,
         not_ocp_home =aantal_niet_bewoonde_woningen,
         
         gas_using = gemiddeld_gasverbruik_woning,
        # zip_house_value = gemiddelde_woz_waarde_woning,
         
         elec_using = gemiddeld_elektriciteitsverbruik_woning,
         dis_to_train = dichtstbijzijnde_treinstation_afstand_in_km,
         dis_to_sub = dichtstbijzijnde_overstapstation_afstand_in_km,
         fire_station = dichtstbijzijnde_brandweerkazerne_afstand_in_km,
         daycare_num = kinderdagverblijf_aantal_binnen_3_km,
         )%>%
  dplyr::select(postcode4,daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,gas_using,not_ocp_home,owned_pct,rental_home,median_HH,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,famale_residents,male_residents,residents)

ams_census$median_HH_num <- ifelse(ams_census$median_HH == "00-20 laag", 1,
                                   ifelse(ams_census$median_HH == "20-40 onder midden", 2,
                                          ifelse(ams_census$median_HH == "20-60 onder midden tot midden", 3,
                                                 ifelse(ams_census$median_HH == "40-60 midden", 4,
                                                        ifelse(ams_census$median_HH == "40-80 midden tot boven midden", 5,
                                                               ifelse(ams_census$median_HH == "60-100 boven midden tot hoog", 6,
                                                                      ifelse(ams_census$median_HH == "60-80 boven midden", 7,
                                                                             ifelse(ams_census$median_HH == "80-100 hoog", 8, NA))))))))
nhoods <- st_join(nhoods,ams_census)%>% filter(postcode4 != 1101 & postcode4 != 1043 )
#daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,zip_house_value,gas_using,not_ocp_home,owned_pct,rental_corp,rental_home,median_HH,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,famale_residents,male_residents,residents

Analysing Correlations

Use ggcorrplot to perform correlation analysis on the results. We selected postcode4, daycare_num, gas_using, owned_pct, median_HH_num, pct_dutch_bkg, residents, price for modeling

census_les <-nhoods%>%st_drop_geometry()%>%select(postcode4,daycare_num,fire_station,dis_to_sub,dis_to_train,elec_using,gas_using,not_ocp_home,owned_pct,rental_home,median_HH_num,high_income_pct,low_income_pct,avr_fam_size,pct_nonwes_bkg,pct_wes_bkg,pct_dutch_bkg,residents,price)%>%na.omit()
census_les<- transform(census_les,postcode4 = as.numeric(postcode4)) %>% filter(postcode4 != 1101 & postcode4 != 1043 )



ggcorrplot(
  round(cor(census_les), 1), 
  p.mat = cor_pmat(census_les),
  colors = c("#E63946", "white", "#2A9D8F"),
  type="lower",
  insig = "blank",
   #lab = TRUE
  ) +  
    labs(title = "Correlation across census data") 

census_les %>% 
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) + 
    geom_smooth(data = . %>% filter(price >0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, ncol = 5, scales = "free") +
     labs(title = "Price as a function of census values") +
     plotTheme()

ggcorrplot(
  round(cor(census_les%>%select(postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price)), 1), 
  p.mat = cor_pmat(census_les%>%select(postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price)),
  colors = c("#E63946", "white", "#2A9D8F"),
  type="lower",
  insig = "blank",
   lab = TRUE) +  
    labs(title = "Correlation across census values") 

#select postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,price

3 Model Building

In this section, we enter the model building phase. Our goal is to develop a predictive model that can accurately predict Airbnb listing prices based on various factors identified during the data analysis stage. The models we will explore include baseline models, neighborhood models, lag models, and models that include neighborhood effects and price lag as well as sentiment factors.

3.1 Data Selections

In the data selection stage, we select the most relevant and influential data for analysis through the results of the correlation scores above, laying the foundation for our modeling process. This step is crucial, ‘garbage in, garbage out’. Because the accuracy and generalization of our predictive models depends heavily on the quality and appropriateness of the selected data.

listing_details<- listing_details%>%st_drop_geometry()%>%select(-room_type,-reviews_per_month,-minimum_nights)

nhoods_sub <- left_join(nhoods,listing_details,  by = 'id')%>% filter(price < 300)%>%
  select(id,price,cancellation_policy,bed_type,property_type,room_type,accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month,review_polarity,review_scores_rating,greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal,postcode4,daycare_num,dis_to_sub,elec_using,gas_using,owned_pct,median_HH_num,avr_fam_size,pct_wes_bkg,pct_dutch_bkg,residents,neighbourhood) %>%na.omit()

nhoods_sub <-transform(nhoods_sub, id = as.numeric(id)) 
#,postcode4 = as.numeric(postcode4)
numericVars <- 
  select_if(st_drop_geometry(nhoods_sub), is.numeric) %>% na.omit() 

3.1 Neighborhood Effects

The neighborhood effects on Airbnb prices is studied. Specifically, we employ Moran’s I statistic to test for spatial autocorrelation in listing prices. Spatial autocorrelation refers to the degree to which objects close to one another are similar in value—in this case. The result shows that the price is effected, this also indicates that we should consider the impact of spatial lag in the subsequent modeling process..

coords <- st_coordinates(nhoods_sub) 

neighborList <- knn2nb(knearneigh(coords, 4))

spatialWeights <- nb2listw(neighborList, style="W")

nhoods_sub$lagPrice <- lag.listw(spatialWeights, nhoods_sub$price)



moranTest <- moran.mc(nhoods_sub$price, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#E63946",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

3.2 Unique Factor combinations

In order to make the prediction results more accurate, we tested combinations of different variables. Most of them had no significant impact on the results, but we left

per_bathroom | A function of bathrooms and accommodations, this may represent the adequacy of accommodations in a cheap listing,

fixed_score uses lagPrice and the review quality of this listing to infer the price model. In a high-priced area, if a listing has higher reviews, it may mean that its price is lower.

#额外加了review_scores_rating
nhoods_sub$per_bathroom <- nhoods_sub$bathroom / nhoods_sub$accommodates
#current price value
nhoods_sub <- nhoods_sub%>%
  mutate(fixed_score = lagPrice*(0.1 +review_polarity)*(101-review_scores_rating)*bedrooms*market_nn2)

3.3 Model Building

We have conducted five distinct linear models, each utilizing different aspects of our comprehensive dataset. The goal is to understand and accurately predict listing prices, considering a variety of factors. By comparing the performance of these model, we could find the better way for our product. The models are:

Baseline Model | This model serves as our foundation. It incorporates fundamental features Neighborhoods Model | This model adds impact of neighborhood characteristics on listing prices Zip Model | This model zooms adds impact of importance of postal code areas. Spatial Lag Model | This model adds how the prices of nearby listings (spatial lag) influence a particular listing’s price. Neighborhoods with LagPrice Model | It combines the insights from neighborhood characteristics with spatial lag considerations

data_to_train <- nhoods_sub %>%
  select(price,cancellation_policy,bed_type,room_type,accommodates,bathrooms,bedrooms,cleaning_fee,amenities_lengths,extra_people,reviews_per_month,review_polarity,greenroof_nn3,wallart_nn3,market_nn2,tram_nn3,age,AvrVal,postcode4,daycare_num,dis_to_sub,gas_using,owned_pct,median_HH_num,pct_wes_bkg,pct_dutch_bkg,residents,per_bathroom,fixed_score,neighbourhood,lagPrice)

set.seed(825)
inTrain <- createDataPartition(
              y = paste(data_to_train$room_type), 
              p = .70, list = FALSE)
ams.training <- data_to_train[inTrain,]
ams.test <- data_to_train[-inTrain,] 

ams.training.baseline <- ams.training %>%
  select(-postcode4,-neighbourhood,-lagPrice,-per_bathroom,-fixed_score)

ams.training.nhoods <- ams.training %>%
  select(-postcode4,-lagPrice,-per_bathroom,-fixed_score)

ams.training.zip <- ams.training %>%
  select(-neighbourhood,-lagPrice,-per_bathroom,-fixed_score)

ams.training.lag <- ams.training %>%
  select(-postcode4,-neighbourhood)

ams.training.hoods_lag <- ams.training 

#temp

reg.baseline <-
   lm(price ~ ., data = st_drop_geometry(ams.training.baseline))

reg.nhoods <-
   lm(price ~ ., data = st_drop_geometry(ams.training.nhoods))

reg.zip <-
   lm(price ~ ., data = st_drop_geometry(ams.training.zip))

reg.lag <-
   lm(price ~ ., data = st_drop_geometry(ams.training.lag))

reg.hoods_lag <-
   lm(price ~ ., data = st_drop_geometry(ams.training.hoods_lag))




ams.test <-
  ams.test %>% 
  na.omit() 

ams.test.baseline <- ams.test %>%
  mutate(Regression = "Baseline Regression",
         listing_price.Predict = predict(reg.baseline, ams.test),
         listing_price.Error = listing_price.Predict - price,
         listing_price.AbsError = abs(listing_price.Predict - price),
         listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict) 

ams.test.nhoods <- ams.test %>%
  mutate(Regression = "Neighborhoods Regression",
         listing_price.Predict = predict(reg.nhoods, ams.test),
         listing_price.Error = listing_price.Predict - price,
         listing_price.AbsError = abs(listing_price.Predict - price),
         listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict) 

ams.test.zip <- ams.test %>%
  mutate(Regression = "Zip Regression",
         listing_price.Predict = predict(reg.zip, ams.test),
         listing_price.Error = listing_price.Predict - price,
         listing_price.AbsError = abs(listing_price.Predict - price),
         listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict) 

ams.test.lag <- ams.test %>%
  mutate(Regression = "Spatial Lag  Regression",
         listing_price.Predict = predict(reg.lag, ams.test),
         listing_price.Error = listing_price.Predict - price,
         listing_price.AbsError = abs(listing_price.Predict - price),
         listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict) 

ams.test.hoods_lag <- ams.test %>%
  mutate(Regression = "Neighborhoods with LagPrice  Regression",
         listing_price.Predict = predict(reg.hoods_lag, ams.test),
         listing_price.Error = listing_price.Predict - price,
         listing_price.AbsError = abs(listing_price.Predict - price),
         listing_price.APE = (abs(listing_price.Predict - price)) / listing_price.Predict) 


model_MAE_summaries <- bind_rows(
  ams.test.baseline %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(listing_price.AbsError),
            MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Baseline Model"),
  
    ams.test.nhoods %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(listing_price.AbsError),
            MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Neighborhoods Model"),
  
    ams.test.zip %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(listing_price.AbsError),
            MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Zip Model"),
  
    ams.test.lag %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(listing_price.AbsError),
            MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Spatial Lag Model"),
  
    ams.test.hoods_lag %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(listing_price.AbsError),
            MAPE = mean(abs(listing_price.APE)*100)) %>% mutate(model = "Neighborhoods with LagPrice Model"),
  

)


kable(model_MAE_summaries, caption = 'Table 4.1 Summary of Model Performance') %>%
  kable_styling("striped", full_width = F)
Table 4.1 Summary of Model Performance
MAE MAPE model
29.60733 22.50012 Baseline Model
29.18136 22.06071 Neighborhoods Model
29.19889 22.01905 Zip Model
29.40483 22.36386 Spatial Lag Model
29.02514 21.91027 Neighborhoods with LagPrice Model

3.4 Model Performance

We also tested the model on the test set - and here are the result of them.

The overall result is good with a 21% MAPE and 0.45 of R Square. The difference among the result are rather the same, which could indicate the model did not handle neighborhood.

model_summaries <- bind_rows(
  glance( reg.baseline) %>% mutate(model = "Baseline Model"),
  glance(reg.nhoods) %>% mutate(model = "Neighborhoods Model"),
    glance(reg.zip) %>% mutate(model = "Zip Model"),
    glance(reg.lag) %>% mutate(model = "Spatial Lag Model"),
    glance(reg.hoods_lag) %>% mutate(model = "Neighborhoods with LagPrice Model")
)

# Create the table
kable(model_summaries, caption = 'Table 4.2 Summary of Model Evaluation Parameters') %>%
  kable_styling("striped", full_width = F)
Table 4.2 Summary of Model Evaluation Parameters
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs model
0.4336975 0.4321463 39.39462 279.58127 0 31 -57779.53 115625.1 115867.2 17563258 11317 11349 Baseline Model
0.4406986 0.4381239 39.18672 171.16585 0 52 -57708.94 115525.9 115922.1 17346126 11296 11349 Neighborhoods Model
0.4453611 0.4407781 39.09406 97.17722 0 93 -57661.44 115512.9 116209.9 17201525 11255 11349 Zip Model
0.4395207 0.4378364 39.19675 260.94982 0 34 -57720.88 115513.8 115777.9 17382658 11314 11349 Spatial Lag Model
0.4518056 0.4462919 38.90085 81.94296 0 113 -57595.12 115420.2 116264.0 17001655 11235 11349 Neighborhoods with LagPrice Model
plot_summs(reg.baseline, scale = TRUE)

4 Result and Testing

To better exam the model and explore the real world impact, we made further testing to the model.

4.1 MAPE Visualizations

The first step to assess the performance of our most comprehensive model, the “Neighborhoods with LagPrice Model”. A distribution map of MAPE allows us to observe how well the model’s predictions align with actual prices across different areas of Amsterdam.

ggplot(ams.test.hoods_lag%>% filter(listing_price.APE<=1)) +
  geom_sf(aes(colour = listing_price.APE)) +
  scale_colour_viridis_c() +
  geom_sf(data = district, fill = NA) +
  labs(title = "MAPE Distribution",
       subtitle = 'Amsterdam, 2019') +
  mapTheme() 

4.2 Model Errors

Then we created a scatter plot to compare the actual listing prices against the predicted prices from the “Neighborhoods with LagPrice Model.” This visualization shows that the original data is relatively dispersed, can our model stocked more on the MAEs rather than the explanation of trends.

ggplot(ams.test.hoods_lag, aes(x = price, y = listing_price.Predict)) +
  geom_point(alpha = 0.2,color = "#2A9D8F") +
  labs(title = "Predicted price vs real listing price",
       subtitle = "Based on neighborhood-lagprice model",
       x = "Actual listing",
       y = "Predicted listings") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color ='#E63946' ) +
  theme_minimal()

4.3 Did the spatial hood-lag model work for neighborhood effects?

We aimed to evaluate the effectiveness of the spatial hood-lag model. To do this, we used Moran’s I statistic again to test for spatial autocorrelation in the prediction errors of the model.

Spatial autocorrelation in this context would mean that the model’s errors are totaly randomly distributed across space and our model is effective.

coords <- st_coordinates(ams.test.hoods_lag) 

neighborList <- knn2nb(knearneigh(coords, 4))

spatialWeights.nhoods <- nb2listw(neighborList, style="W")

moranTest.nhoods_lag <- moran.mc(ams.test.hoods_lag$listing_price.Error, 
                      spatialWeights.nhoods, nsim = 999)

ggplot(as.data.frame(moranTest.nhoods_lag$res[c(1:999)]), aes(moranTest.nhoods_lag$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest.nhoods_lag$statistic), colour = '#2A9D8F',size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Moran's I for Prediction Result Errors",
       subtitle= "Based on Prediction Error for Neighborhoods with LagPrice Model",
       x="Moran's I",
       y="Count") +
  plotTheme()

4.4 Comprehensive Regression Analysis

In this section, we are consolidating the results from all the regression models to compare their performance across different metrics. This approach allows us to assess each model’s effectiveness in predicting Airbnb listing prices and to understand how they perform relative to each other.

Among them, Neighborhoods with LagPrice Model, which combines neighborhood characteristics with spatial lag, shows the best performance among all models. It underscores the importance of considering both local attributes and the influence of nearby listings.

AllRegressions <- 
  rbind(
    dplyr::select(ams.test.baseline, starts_with("listing_"), Regression, neighbourhood, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
    
    dplyr::select(ams.test.nhoods, starts_with("listing_"), Regression, neighbourhood, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
    
    dplyr::select(ams.test.zip, starts_with("listing_"), Regression, neighbourhood, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
    
    dplyr::select(ams.test.lag, starts_with("listing_"), Regression, neighbourhood, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)),
    
    dplyr::select(ams.test.hoods_lag, starts_with("listing_"), Regression, neighbourhood, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.nhoods, listing_price.Error)))  

AllRegressions <- na.omit(AllRegressions)
st_drop_geometry(AllRegressions) %>%
  gather(Variable, Value, -Regression, -neighbourhood) %>%
  filter(Variable == "listing_price.AbsError" | Variable == "listing_price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable(caption = 'Table 4.3 Summary of Model Errors')%>%
    kable_styling("striped", full_width = F)
Table 4.3 Summary of Model Errors
Regression listing_price.AbsError listing_price.APE
Baseline Regression 29.60733 0.2250012
Neighborhoods Regression 29.18136 0.2206071
Neighborhoods with LagPrice Regression 29.02514 0.2191027
Spatial Lag Regression 29.40483 0.2236386
Zip Regression 29.19889 0.2201905

Model Performance - what are key facators that impact the accuracy??

Then we visualized the relationship between various independent variables and the prediction errors (Absolute Error) of the “Neighborhoods with LagPrice” model. This clearly shows some weakness of our model - when dealing with luxury houses and very large properties (especially having large number of bathrooms and bedrooms), the accuracy drops a lot.

ams.test.hoods_lag%>% st_drop_geometry() %>% select(accommodates,bathrooms,bedrooms,cleaning_fee,lagPrice,amenities_lengths,extra_people,reviews_per_month,review_polarity,greenroof_nn3,owned_pct,pct_dutch_bkg,listing_price.AbsError)%>% na.omit()%>%
  gather(Variable, Value, -listing_price.AbsError) %>% 
   ggplot(aes(Value, listing_price.AbsError)) +
     geom_point(size = .5, color = "#2A9D8F", alpha = 0.45) + 
    geom_smooth( data = . %>% filter(listing_price.AbsError > 0), method = "lm", se=F, colour = "#005B96") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price AbsError as a function of variables") +
     plotTheme()

4.5 Comparative Model Performance Visualization

In this section, we’re visualizing and comparing the performance of the various regression models developed. The aim is to understand how well each model predicts Airbnb listing prices compared to the actual prices. Based on the result, all of our model are fantastic - as a linear model, It can not predict all the detail of data change, but provide significant predictive power, especially when dealing with large and diverse data sets like ours. Our model shows that even though linear regression itself is simple, we can achieve high accuracy in predicting Airbnb listing prices.

If our raw data were less dispersed, other predictors would improve significantly.

AllRegressions%>% st_drop_geometry() %>%
  dplyr::select(listing_price.Predict, price, Regression) %>%
    ggplot(aes(price, listing_price.Predict)) +
  geom_point(color = '#005B96',alpha = 0.1) +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#E63946") + 
  stat_smooth(aes(listing_price.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#2A9D8F") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Red line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

4.6 Cross-Validation of the Regression Model

We started evaluating the robustness and reliability of our regression model through cross-validation.

Cross-validation is a statistical method used to estimate the skill of machine learning models. Our model shows a great accuracy, narrow distribution centered around a MAE value of 30 has indicated high reliability and accuracy.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(data_to_train), 
        method = "lm", trControl = fitControl, na.action = na.pass)


ggplot(data = reg.cv$resample) +
  geom_histogram(aes(x = reg.cv$resample$MAE), fill = '#2A9D8F') +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 100\n",
       caption = "Figure RESULT 4.2") +
  xlab('MAE of Model') +
  ylab('Count') +
  plotTheme()

4.7 Result visualizations

Neighborhood Comparison

In this section, we’re focusing on visualizing the results of our regression models in relation to different neighborhoods. This approach allows us to understand how the performance of each model varies across various areas in Amsterdam. We use census data ( ams_census ) to classify neighborhoods according to racial background, family background, and income level, and then evaluate the model’s performance in these different contexts.

st_drop_geometry(AllRegressions) %>%
  group_by(Regression, neighbourhood) %>%
  summarize(mean.MAPE = mean(abs(listing_price.APE * 100), na.rm = T)) %>%
  ungroup() %>% 
  left_join(district) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = AllRegressions, colour = '#005B96', size = .1,alpha = 0.2) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[4], high = palette5[2],
                          name = "Mean Absolute Percent Error") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()

Model Performance Based on Socioeconomic Background

The following series of code snippets aims to understand how our model performs in different socio-economic contexts in Amsterdam. We use census data (ams_census) to classify neighborhoods according to racial background, family background, and income level, and then evaluate the model’s performance in these different contexts.

tracts17 <- ams_census %>%
  select(median_HH, pct_dutch_bkg, avr_fam_size)%>%
  mutate(raceContext = ifelse(pct_dutch_bkg > 50, "Majority Dutch", "International Zone"),
         famContext = ifelse(avr_fam_size > 2, "Large Fams", "Small Fams"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#2A9D8F", "#E63946"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = famContext)) +
    scale_fill_manual(values = c("#2A9D8F", "#E63946"), name="Family Context") +
    labs(title = "Family Context") +
    mapTheme() + theme(legend.position="bottom"),
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = median_HH)) +
    scale_fill_manual(values = c("#2a9d8f", "#458f85", "#60807a", "#7b7270", "#956465", "#b0565b", "#cb4750", "#e63946"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

This code combines the regression results with the racial background data and summarizes the MAPE for each model in each racial background. The results show that our model performs consistently in different ethnic gathering areas and exhibits high generality.

st_join(AllRegressions, tracts17) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context")
Test set MAPE by neighborhood racial context
Regression International Zone Majority Dutch
Baseline Regression 23% 22%
Neighborhoods Regression 22% 22%
Neighborhoods with LagPrice Regression 22% 21%
Spatial Lag Regression 23% 22%
Zip Regression 22% 22%

This code combines the regression results with household size background data and summarizes the MAPE for each model in each size context. The results show that our model performs better in areas where smaller families gather, which may indicate urban centers or areas where young people gather.

st_join(AllRegressions, tracts17) %>% 
  filter(!is.na(famContext)) %>%
  group_by(Regression, famContext) %>%
  summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(famContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood Family context")
Test set MAPE by neighborhood Family context
Regression Large Fams Small Fams
Baseline Regression 29% 22%
Neighborhoods Regression 27% 22%
Neighborhoods with LagPrice Regression 26% 22%
Spatial Lag Regression 28% 22%
Zip Regression 26% 22%

This code combines the regression results with the income context data and summarizes the MAPE for each model in the income context. The results show that our model performs relatively stably in different income regions.

st_join(AllRegressions, tracts17) %>% 
  filter(!is.na(median_HH)) %>%
  group_by(Regression, median_HH) %>%
  summarize(mean.MAPE = scales::percent(mean(listing_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(median_HH, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")
Test set MAPE by neighborhood income context
Regression 20-40 onder midden 20-60 onder midden tot midden 40-60 midden 40-80 midden tot boven midden 60-100 boven midden tot hoog 60-80 boven midden 80-100 hoog
Baseline Regression 26% 22% 22% 25% 21% 23% 24%
Neighborhoods Regression 24% 22% 21% 22% 22% 23% 24%
Neighborhoods with LagPrice Regression 24% 22% 21% 31% 26% 23% 24%
Spatial Lag Regression 26% 22% 22% 25% 20% 23% 25%
Zip Regression 24% 22% 21% 31% 26% 23% 24%

5 Discussion & Conclusion

We analyzed Airbnb listings in Amsterdam using a variety of datasets and methodologies and derived some important insights. We have successfully developed and evaluated multiple linear models: Baseline Model, Neighborhoods Model, Zip Model, Spatial Lag Model, and Neighborhoods with LagPrice Model, providing a broad spectrum for comparison and understanding of listing prices. Each model provides a unique perspective on predicting Airbnb listing prices. Despite the simplicity of the linear model, our method shows considerable predictive accuracy, as evidenced by the cross-validation results and MAE distribution. But they may not fully capture the complex nonlinear relationships in the data. This is especially true when dealing with luxury listings and properties with a large number of rooms. At the same time, the dispersion of raw data creates challenges, especially in capturing trends and making accurate predictions of market outliers. Our approach combines traditional listing characteristics with spatial and socioeconomic factors, increasing the depth and accuracy of our analysis.

As a future direction, our plan is to further integrate this model into our plug-in. These insights can enhance the Airbnb user experience and, if incorporated with real-time data, will enhance the applicability and accuracy of the model. However, we also need to conduct more tests in cities other than Amsterdam to determine whether our model generalizes well. European cities will pose challenges to this model in terms of spatial pattern, distribution, design, and historical and cultural influence, but we are at least confident that it will show its talents in cities similar to Amsterdam.

References:

“R Document.” RDocumentation, www.rdocumentation.org/. Accessed 20 Dec. 2023.

Steif, Ken. Public Policy Analytics: Code and Context for Data Science in Government. CRC Press, Taylor & Francis Group, 2022.

Luo, Yuanhang, Xuanyu Zhou, and Yulian Zhou. “Predicting airbnb listing price across different cities.” (2019).

Rezazadeh Kalehbasti, Pouya, Liubov Nikolenko, and Hoormazd Rezaei. “Airbnb price prediction using machine learning and sentiment analysis.” International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Cham: Springer International Publishing, 2021.